Skip to content

Conversation

@joostjager
Copy link
Contributor

Update MonitorUpdatingPersister and MonitorUpdatingPersisterAsync to queue persist operations in memory instead of writing immediately to disk. The Persist trait methods now return ChannelMonitorUpdateStatus:: InProgress and the actual writes happen when flush() is called.

This fixes a race condition that could cause channel force closures: previously, if the node crashed after writing channel monitors but before writing the channel manager, the monitors would be ahead of the manager on restart. By deferring monitor writes until after the channel manager is persisted (via flush()), we ensure the manager is always at least as up-to-date as the monitors.

Key changes:

  • Add PendingWrite enum to represent queued write/remove operations
  • Add pending_writes queue to MonitorUpdatingPersisterAsyncInner
  • Add flush() to Persist trait and ChainMonitor
  • Update Persist impl to queue writes and return InProgress
  • Call flush() in background processor after channel manager persistence
  • Remove unused event_notifier from AsyncPersister

@ldk-reviews-bot
Copy link

👋 Hi! I see this is a draft PR.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

use core::mem;
use core::ops::Deref;
use core::pin::{pin, Pin};
use core::pin::pin;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to do this without touching persist.rs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would we do it then? Queue in ChainMonitor, in KVStore, or somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChainMonitor would need to store the actual monitor data to defer writes, not just track update IDs as it does now. This means either cloning expensive ChannelMonitor objects or storing serialized bytes, which leaks persistence format details into ChainMonitor?

@joostjager joostjager self-assigned this Jan 15, 2026
@joostjager joostjager force-pushed the mon-barrier branch 3 times, most recently from 93ff6c9 to 181a6e0 Compare January 16, 2026 09:40
@joostjager
Copy link
Contributor Author

ldk-node's channel_full_cycle test now passes with deferred monitor writes. This validates that channel open, bidirectional payments, and close all work correctly when persist methods return InProgress and actual writes happen on flush(). The critical piece is calling channel_monitor_updated after each flush to unblock the channel - without this, channels would stay paused forever waiting for persistence confirmation.

Not yet tested: node restart scenarios where the manager/monitor ordering invariant matters, which is the primary motivation for this change.

Update MonitorUpdatingPersister and MonitorUpdatingPersisterAsync to
queue persist operations in memory instead of writing immediately to
disk. The Persist trait methods now return ChannelMonitorUpdateStatus::
InProgress and the actual writes happen when flush() is called.

This fixes a race condition that could cause channel force closures:
previously, if the node crashed after writing channel monitors but
before writing the channel manager, the monitors would be ahead of
the manager on restart. By deferring monitor writes until after the
channel manager is persisted (via flush()), we ensure the manager is
always at least as up-to-date as the monitors.

The flush() method takes a count parameter specifying how many queued
writes to flush. The background processor captures the queue size
before persisting the channel manager, then flushes exactly that many
writes afterward. This prevents flushing monitor updates that arrived
after the manager state was captured.

Key changes:
- Add PendingWrite enum with FullMonitor and Update variants for queued writes
- Add pending_writes queue to MonitorUpdatingPersisterAsyncInner
- Add pending_write_count() and flush(count) to Persist trait and ChainMonitor
- ChainMonitor::flush() calls channel_monitor_updated for each completed write
- Stale update cleanup happens in flush() after full monitor is written
- Call flush() in background processor after channel manager persistence

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@joostjager
Copy link
Contributor Author

The ldk-node payment benchmark (1000 payments, 10 iterations) shows a ~14% slowdown with deferred writes:

main: 9.65s median
deferred: 10.96s median

This is expected since we now queue writes in memory and flush them after channel manager persistence, adding overhead to the persist cycle. The tradeoff is correctness - ensuring monitors are never ahead of the manager on disk after a crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants